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Abstract — Opportunistic spectrum access (OSA) is a Isey tech- 
nique enabling the secondary users (SUs) in a cognitive radio 
(CR) network to transmit over the "spectrum holes" unoccupied 
by the primary users (PUs). In this paper, we focus on the 
OSA design in the presence of reactive PUs, where PU's access 
probability in a given channel is related to SU's past access 
decisions. We model the channel occupancy of the reactive 
PU as a 4-state discrete-time Markov chain. We formulate 
the optimal OSA design for SU throughput maximization as a 
constrained finite-horizon partially observable Markov decision 
process (POMDP) problem. We solve this problem by first 
considering the conventional short-term conditional collision 
probability (SCCP) constraint. We then adopt a long-term PU 
throughput (LPUT) constraint to effectively protect the reactive 
PU transmission. We derive the structure of the optimal OSA 
policy under the LPUT constraint and propose a suboptimal 
policy with lower complexity. Numerical results are provided 
to validate the proposed studies, which reveal some interesting 
new tradeoffs between SU throughput maximization and PU 
transmission protection in a practical interaction scenario. 

Index Terms — Opportunistic spectrum access, reactive pri- 
mary user, cognitive radio, partially observable Markov decision 
process (POMDP), dynamic programming. 



I. Introduction 

By enabling the secondary users (SUs) to access the un- 
occupied channels of the primary users (PUs) in a cognitive 
radio (CR) network, opportunistic spectrum access (OSA) is 
regarded as one promising solution to resolving the spectrum 
scarcity versus spectrum underutilization paradox in wireless 
communications |IT]-|[3l. To design optimal OSA strategies, 
two competing goals are addressed at the same time: the 
"spectrum holes" unused by the PUs should be optimally ex- 
plored by the SUs to maximize their throughput, whereas the 
probability of the SU's transmission collision with undetected 
active PUs should be minimized. In this paper, we study the 
OSA design for SUs in the presence of reactive PUs and 
aim at achieving the optimal tradeoffs between SU throughput 
maximization and PU collision minimization. 

A. Related Work 

A great deal of valuable prior work has investigated the 
OSA design for CR networks. Assuming that SU is only able 
to sense a certain part of the spectrum at each time due to 
hardware limitations, the authors in H proposed a partially 
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observable Markov decision process (POMDP) framework to 
design the optimal OSA. However, due to "the curse of dimen- 
sionality", POMDP problems are in general computationally 
prohibitive to solve IS'l, |6|. Zhao et al. in l?) formulated 
the design of SU's optimal sensing policy as a POMDP and 
proposed a myopic sensing policy, which maximizes SU's 
average reward over a finite horizon. In ||8], Chen et al. 
proposed a threshold-based optimal spectrum sensing and 
accessing policy, which maximizes SU's throughput during its 
battery lifetime. Both Q and H] assumed that sensing errors 
are negligible. In ||9l, Chen et al. considered OSA design in 
the presence of sensing errors and proposed a short-term con- 
ditional collision probability (SCCP) constraint for protecting 
PUs from SU's collisions in a time-slotted primary system. 
Moreover, ||9l proposed a separation principle to significantly 
reduce the complexity of solving the constrained finite horizon 
POMDP problem, which maximizes SU's throughput subject 
to the SCCP constraint. Since the SCCP constraint is able 
to provide effective protection to PUs' transmission (9l, it 
has been widely adopted in subsequent studies on OSA. For 
example, a similar POMDP problem subject to the SCCP 
constraint was considered in ifTol for unslotted PU systems. 
An online OSA algorithm by learning PU's signal statistics 
was proposed in ifTTl under the SCCP constraint. Li et al. 
in lfT2l showed that when the SCCP constraints over time are 
tight, the optimal OSA policy can be implemented as a simple 
memoryless policy with periodic channel sensing. 

Most existing work on OSA with time-slotted PUs, includ- 
ing the aforementioned one, has assumed a non-reactive PU 
model, where PU's transmission over a particular channel 
evolves as a 2-state on/off Markov chain with fixed state 
transition probabilities. Similar assumptions can also be found 
in the experimental based work on OSA with unslotted PUs, 
such as ifTSi and fT4^. Although greatly simplifying the OSA 
design, the non-reactive PU model might not be practical 
since existing wireless systems are mostly intelligent enough 
to adapt their transmissions upon experiencing collision or 
interference. For example, a PU may increase transmit power 
to compensate the link loss due to the received interference. 
Alternatively, it may reduce the channel access probability 
when collision occurs in a carrier sensing multiple access 
(CSMA) based primary system. In this paper, we refer to such 
PUs as reactive PUs, to differentiate from their non-reactive 
counterparts. 

In this paper, we focus on designing SU's optimal OSA 
policy in the presence of time-slotted reactive PUs. It is 
worth noting that there has been recent work that addressed 
reactive PUs for OSA and/or spectrum sharing (SS) based 
CR networks. In contrast to OSA, with SS, SU is allowed 



to transmit regardless of the PU's on/off status, provided that 
the resuhing interference to PU is kept below a predefined 
threshold. In ifTSl . the authors proposed a hidden power- 
feedback loop for the CR: If PU is reactive and reacts upon re- 
ceiving SU's interference, SU will receive a power-boosted PU 
signal that is easier to detect. Following ifTSl . lfT6l proposed 
a proactive sensing scheme and a sequential transmit power 
adaptation strategy to exploit spectrum opportunities in the SS 
based CR. In IfTTl . the author extended the work in IfTSl and 
designed active learning and supervised transmission schemes. 
Automatic retransmission request (ARQ) based reactive PUs 
have been considered in, e.g., |fT9l and |f20|, for the SS based 
CR. Under the assumption that SU has full knowledge of 
PU's buffer state and ARQ state, the authors in lfT9l adopted 
a Markov process based model to determine SU's optimal 
transmission policy over an infinite horizon, which maximizes 
SU's long-term average throughput subject to PU's long- 
term throughput loss. As shown in |fT9l , the SU's optimal 
transmission policy is stationary and thus can be obtained by 
solving a linear program. Online algorithms have also been 
proposed in |20| for the cases where only partial and/or noisy 
observations of PU's buffer state and ARQ state are available 
to the SU. Compared with the existing work for SS based CR, 
the work considering reactive PUs for OSA based CR is very 
limited. It is noted that a CSMA-based reactive PU model 
has been proposed in ifTsl to investigate the performance of 
different SU access policies; however, lITSI did not address 
the optimal OSA design. 



B. Main Results 

In this paper, we focus on the effects of SUs' channel 
access actions on the reactive PUs' transmission quality. Since 
the existence of the secondary network is usually oblivious 
to PUs, we assume that PUs only implement conventional 
techniques, such as energy detection, to detect the existence 
of interference/collision; thus, PUs are not able to differentiate 
the received interference/collision from other PUs and that 
from SUs. In addition, there might be other unexpected co- 
channel interference and noise at the primary receiver, which 
can also evoke reactions of PUs. It is assumed that the reactive 
PUs treat all the received interference/collision in the same 
way and react to it accordingly. 

We consider an OSA-based CR network, in which one 
SU transmits opportunistically over N orthogonal frequency 
bands, each of which is assigned to one PU. In each time- 
slot, the SU selects one channel to sense by choosing a 
spectrum sensor operating point, and then determines whether 
to access the selected channel based on the sensing result. To 
maximize the SU's throughput subject to PUs' transmission 
protection, we formulate the OSA design problem as a con- 
strained POMDP problem. The main results of this paper are 
summarized as follows. 

• We propose a new 4-state discrete-time Markov chain 
model to describe the channel occupancy state of each re- 
active PU, which includes the conventional 2-state on/off 
model for the non-reactive PU as a special case. The 
expanded state space and state transition probabilities in 



the new model are used to specify the reactions of PU 
subject to SU's transmit collision. 

By adopting the conventional SCCP constraint to protect 
PU's transmission as in |91, we study the optimal OSA 
policy under the proposed reactive PU model. We extend 
the separation principle proposed in ||9l for the non- 
reactive PU case to the reactive PU case, and obtain the 
optimal OSA policy that can be implemented efficiently. 
However, unlike the non-reactive PU case, we show 
that the reactive PU's throughput in general cannot be 
guaranteed under the SCCP constraint. 
To effectively protect the reactive PU's transmission, we 
adopt a long-term PU throughput (LPUT) constraint, 
similar to the one proposed for the SS based CR in [|2T| . 
Under this constraint, we first study the OSA design 
for PU's worst case transmission with A^ = 1, i.e., 
there is only one pair of PU and SU sharing a single 
channel. We obtain the optimal OSA policy structure 
in this case, which reveals that the spectrum sensor 
design plays a crucial role in effectively protecting PU's 
transmission. Noticing the high complexity in designing 
an effective spectrum sensor due to the non-deterministic 
belief state transitions of POMDP, we thus convert the 
POMDP into an equivalent Markov decision process 
(MDP) with deterministic state transitions. By studying 
the reformulated MDP-based LPUT constraint, we pro- 
pose a suboptimal OSA policy with lower implementa- 
tion complexity, which is shown to guarantee the reactive 
PU's throughput. Based on the separation principle, we 
then extend the suboptimal policy for the case of A^ = 1 
to the general case of iV > 1 and show that the reactive 
PU's throughput on each channel is guaranteed by the 
proposed suboptimal policy. 



C. Organization 

The rest of this paper is organized as follows. Section II 
presents the channel occupancy model for reactive PUs in a 
CR network. Section III formulates the OSA design under the 
reactive PU model as a constrained POMDP problem. Section 
IV studies the POMDP problem under the conventional SCCP 
constraint and develops the optimal OSA policy based on the 
separation principle. Section V studies the POMDP problem 
under the proposed LPUT constraint and proposes a subopti- 
mal policy. Section VI compares numerical examples on the 
performance of the proposed optimal and suboptimal policies. 
Finally, Section VII concludes the paper 

II. System Model 

We consider a CR network consisting of one SU and A^ 
PUs. Each PU is preassigned a dedicated channel and the 
traffic carried by each channel is assumed to be independent 
from each other We assume synchronized time-slotted trans- 
mission for all the PUs and SU as in ll4l. 171. ISl. Il9l. and ifTTI. 
In the following, we model the channel occupancy state of the 
reactive PU and describe the corresponding OSA decisions of 
the SU. 
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Fig. 1. Channel occupancy model for the non-reactive PU. 
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Fig. 2. Channel occupancy model for the reactive PU. 



A. Channel Occupancy Model for Reactive PU 

Fig. 1 shows a typical example of the channel occupancy 
model for the conventional non-reactive PU, which has been 
adopted in prior work, e.g., lH], |@], ifTI- lfTZI . In this model, 
the primary traffic over a given channel is approximated by 
a two-state discrete-time Markov chain with states '0' and 
' 1 ' denoting whether the channel is busy or idle, respectively. 
The PU's state changes slot by slot according to transition 
probabilities a^ and /?o shown in Fig. 1. Clearly, this model 
is not able to reflect the state of reactive PUs, for which the 
state transition depends on the SU's past access decisions. For 
example, the reactive PU usually reduces its channel access 
probability if a collision occurs and increases such probabiUty 
when the environment becomes friendly again (no collision is 
observed). To reflect PU's reactive behaviors in practice, we 
propose an enhanced channel occupancy model. 

The new model is composed of two levels, namely. Level 
and Level L The reactive PU is assumed to have a higher 
probability to access the channel when it is in Level than in 
Level 1. As a result, the enhanced channel occupancy model 
becomes a four-state Markov chain as shown in Fig. 2, with 
each level having two states (busy or idle). For convenience, 
we use 2 bits to represent the total 4 states for the reactive 
PU. The first bit denotes the level and the second bit denotes 
whether the channel is busy ('0') or idle ('1'). 

In each time-slot, the state of a reactive PU evolves ac- 
cording to the state and the SU's action in the previous slot. 
Suppose that initially the PU is in Level with transition 
probabilities ao and /3o, as shown in Fig. L If there is no SU 
accessing the channel, or the SU accesses when the PU is at 
state 'Or, no collision occurs and the PU will stay in Level 0. 
However, if the SU accesses when the PU's state is '00', the 
PU will react to the resulted collision by transiting from Level 

to Level 1, with probability ai to state '11' and probability 

1 — ai to state ' 10' . We assume ai > ao to reflect the reduced 
probability that the reactive PU accesses the channel in Level 
1 than in Level 0. When the PU transits to Level 1, it will 



stay in this level if the SU continues to access the channel. 
However, if the SU does not access the channel and the PU is 
at state '10', the PU observes no collision and thus conceives 
that the environment has become friendly for its transmission. 
As a result, the PU increases its probability to access the 
channel by returning to Level 0, with transition probabilities 
ao to state '01' and 1 — ao to state '00', respectively. Since the 
reactive PU's state transitions are related to the SU's actions at 
state '00' or '10', the corresponding transition probabilities are 
conditioned on the SU's action as shown in Fig. 2. Moreover, 
notice that when the PU's state is '01' or '11', the state 
transition probabilities are not affected by the SU's actions. 
This is because no collision occurs if the PU does not attempt 
to transmit. We assume /3i > /3o to be consistent with the 
reactive PU's more willingness to access in Level than in 
Level 1. Note that when ai = ao and /3i — Pq, the proposed 
4-state channel occupancy model for the reactive PU reduces 
to the conventional 2-state counterpart in Fig. 1 for the non- 
reactive PU. 

It is worth pointing out that our proposed two-level Markov 
chain model is a basic model that captures essential reactions 
of the PU subject to the SU's collisions; and therefore it 
can be generalized to specify more complicated reactions 
of the PU (e.g., random transmission backoff in CSMA) by 
appropriately setting the transition probabilities in each level 
and/or increasing the number of levels in the model. 

B. SU's OSA 

We assume that the SU can only select one channel for 
sensing in each time-slot due to hardware limitations, and 
the sensing result over the selected channel may not be the 
PU's actual state due to sensing errors. Similar to |j9j, the 
SU makes a sequence of decisions in each slot as follows. 
At the beginning of slot t, t > I, the SU transmitter selects 
a channel a{t) G A5 to sense, where A5 = {1,2,...,A^} 
denotes the set of channels. Supposing a{t) ~ a, the SU 
then decides the sensor operating point to sense channel a, 
which is determined by the probability of false alarm (PFA) 
£a{t) G [0,1] and the probability of mis-detection (PM) 
Sa{t) E [0,1]. A feasible operating point must be confined 
by the optimal receiver operating characteristic (ROC) curvqj 
and the line determined by 1 — Sa{t) = ea{t)- The set of all 
feasible operating points is denoted by As(a{t)). Based on the 
sensing result Qa{t) G {0, 1}, the SU decides a pair of access 
probabilities (/^(O, t), /a(l,i)) e [0,1]^ for this channel, 
where fa{d,t) is the access probability on channel a in slot t 
with Qa{t) ~ 0. Denoting $a(i) G {0(not access), l(access)} 
as the SU's access action on channel a in slot t, fa{S,t) is 
expressed by the following conditional probabihty 



fa{e,t)^P{<Pait) = l\eait)=0}. 



(1) 



At the end of slot t, the SU transmitter receives error 
free feedback Ka{t) <E {0,1} from the SU receiver, where 
Ka{t) = 1 means that the SU's information is transmitted 

'Given the maximum allowable PFA e, the smallest achievable probability 
of mis-detection, denoted by 5*, can be attained by the optimal Neyman- 
Pearson detector 122] . By varying e over [0, 1], the resultant <5* and e pairs 
form the optimal ROC curve. 



successfully, and Ka{t) — represents that the SU transmits 
but the SU receiver fails to receive the transmitted information 
due to that the PU is busy and hence a collision occurs. 
Note that if the SU does not transmit, the SU transmitter will 
not receive any feedback. For the ease of representation, we 
assume that this case is also represented by Ka (t) = 0. Note 
that Ka{t) is for the SU transmitter and receiver to maintain 
their decision synchronization [9|. 

III. OSA Design in Partially Observable 
Environments under Reactive PU Model 

As described in Section III-BI in each time-slot, the SU 
selects one channel for sensing and thus is unable to observe 
the PUs' states in the other A^ — 1 channels. Even for the case 
of A^ = 1, the SU may not be able to obtain the PU's actual 
state due to sensing errors. This renders the PUs' states are 
only partially observable at the SU over time. Thus, we adopt 
a POMDP model to design the SU's OSA. In this section, we 
describe the POMDP and formulate the SU's optimal OSA 
design as a constrained POMDP problem. 

A. POMDP Elements 

A POMDP in general consists of the following elements 
Hi a set of time-slots {!,..., T}, where T is called the 
horizon, and a set of system states (with transition proba- 
bilities), actions, observations (with observation probabilities) 
and rewards, for each of the time-slots. In this subsection, we 
formulate the POMDP model for the SU's OSA by specifying 
these elements. 

Specifically, we consider a finite-horizon POMDP with 
T < oo. Each system state in the POMDP is denoted by 
an A^-element vector, with each element representing one 
PU's state at its assigned channel. For brevity, we represent 
the states in Fig. 2, namely, 00,01,10,11, using 0,1,2,3, 
respectively, and denote C5 = {0,1,2,3} as the set of 
the states. Since |Cs| = 4, there are in total 4^ POMDP 
system states. Denote the action space of the POMDP as 
A = {(a(i),(e,(t),(5,(i)),(/a(0,t),/,(l,t))) : a{t) G 

As,(ea(i),'5a(i)) G A,(a(i)), (/,(0,t),/„(l,t)) e [0,1]^}. 
Let the observation in the POMDP be Ka{t) G {0,1} in 
slot t. Suppose that A{t) = A, where A = (a(i) = 
a, (ea(t),(5a(t)),(/a(0,i),/a(l,0))- The observation proba- 
biUty is then denoted as UA{k\i) ~ P{Ka{t) = k\i,A} with 
k = {0, 1}, which represents the conditional probability of 
observing Ka{t) — k given that the SU's action is A and 
the PU's state over selected channel a is i, i G Cg. Let 
/(a, i) = and I{a, t) = 1 represent channel a being busy 
and idle in slot t, respectively. Denote Ij^jj as an indicator 
function, which equals 1 if x is true and otherwise. Note 
that Ka{t) = /(a, i)<I>a(i). By applying a derivation similar to 
that in ||9l, we obtain the following result: 



UA{k\^) 



l[/(a,t) = l]5a(0: if ^ = 1 

l-UA{k = l\i}, if fc = 



(2) 



where 



is the SU's conditional access probabiUty on channel a, given 
that the PU is idle on this channel in slot t and the SU 
selects channel a to sense. If the SU transmits successfully 
in slot t, i.e., the observation Ka{t) = 1, it will obtain a unit 
throughput. Denote the instantaneous reward of the SU in slot 
t by Rs{t). By assuming a unit bandwidth (B = 1) for each 
channel, we have 



Rsit) ^ Ka{t) X B = Kait). 



(5) 



ga{t) ^ P{^a{t) = l|a(i) = a,I{a,t) = 1} (3) 

= Cait) X fa{0,t) + (1 - ea{t)) X /,(l,i) (4) 



At the end of each slot, the POMDP system moves to the 
next state from the current state according to the POMDP 
state transition probability. Since the PUs' states over different 
channels evolve independently, the POMDP system state tran- 
sition probability is obtained as the product of each PU's state 
transition probability. We thus focus on the state transition 
probability for a given channel n G A5. Let i,j G Cg and 
denote P„{i\j,A) as the transition probability from state j in 
slot t to state i in slot i + 1 over channel n under the SU's 

action A = (a(i) = a, (e,(i),,5,(i)), (/a(0, t), /a(l,0))- If 
n y^ a, the SU does not select channel n for sensing. Fig. 3(a) 
shows the state transition probabilities for this case, where 
we use superscript n to denote channel n. These probabilities 
are easily obtained from Fig. 2 with the SU's access action 
being not access. If n ~ a, the SU selects channel a 
to sense (and probably access). Fig. 3(b) shows the state 
transition probabilities for this case. If the state is '01' or 
'11', the transition probabilities are independent of the SU's 
action; otherwise, according to Fig. 2, they are subject to the 
SU's access action $a(0- Note that given the PU's state on 
channel a, $a(0 i^ determined in probability by {ea{t),6a{t)) 
and (/a(0,i),/a(l,i)). Based on Fig. 2, with the fact that 
Pa{i\j ^ 2,A) = Pa{i\j = 0,A),yi G Cg, we thus obtain 
the following transition probabilities: 

(Pai^ = 0\j = 0,A)^Pai^^0\j = 2,A) = {l-^ia{t))x{l~ao), 
\Pai^ = l\j = 0,A) = Pa{^ = l\j = 2,A) = {l-^^a{t))xao, 
\P,{i = 2\j = 0,A) = Pa{i = 2\j = 2,A) = f^a{t)x{l-ai), 
[Paii^m = 0,A) = Pa{i = 3\j = 2,A) = f^a{t)xai, 

where 

fiait) = P{$a(0 = l|a(i) = a,I{a,t) = 0} (6) 

= (1 - 6a{t)) X fa{0,t) + Sa{t) X /,(l,i) (7) 

is the SU's conditional access probability on channel a, given 
that the PU is busy on this channel in slot t and the SU selects 
channel a to sense. 

B. Belief on POMDP States 

In the POMDP model, the system states are not exactly 
known at the SU. However, based on the SU's previous actions 
and observations, a belief on the POMDP system state can be 
obtained. The belief is defined as the conditional probability 
distribution over all possible POMDP system states given the 
history of the SU's actions and observations. As shown in 
||5], the belief on the POMDP system state is a sufficient 
statistic for the design of optimal actions. For our model, 
the POMDP system state consists of the PUs' states over 
independent channels. The SU's belief on the POMDP system 
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Fig. 3. State transition probabilities in the reactive PU model; (a) shows 
the case when n ^ a, and (b) shows the case when n = a. 



A(t) to the channel a{t) £ As selected to sense for this slot. 
Given the selected channel a{t) ~ a, the sensor operating pol- 
icy specifies a sequence of functions as its ^ {d\, d\, ...^d^rp}, 
where df in slot t maps A(t) to a feasible sensor operating 
point {ea{t),5a{t)) € As{a{t)) for this slot. Similarly, the 
access policy is specified as tTc = {df , d^, ...,<i^}. Given the 
sensing result 9(t) £ {0, 1} in slot t, d^ maps A(t) to the 
access probabiUties (/a(0,i), /a(l,0) ^ [OjI]^ ^^ slot t. 



D. Constrained POMDP Problem 

The optimal OS A design {7r|, 7r|, tt*} is obtained by 
solving a constrained POMDP problem, which maximizes 
the SU's expected reward over T slots subject to various 
constraints to protect the PUs' transmission. Specifically, the 
objective of the POMDP problem is to obtain 

T 

{7r:,^|,<} = argmax ii;{,^,,,,,j|V i?s(t)|A(l)}, (10) 

t = l 

where A(l) is the initial belief state in slot t = 1. The 
elements in A(l) are set according to the stationary distri- 
bution of the underlying Markov chain, under the assumption 
that the PU is at Level initially, i.e., A„o(l) = t+^t^i^. 
A„i(l) = 1 - A„o(l), and A„2(l) = A„3(l) = 0, 'inh Ag. 



Suppose that the SU selects channel a in slot t. Given A(i), 
from (|5]l, the SU's expected reward in slot t over all possible 
PU's states on channel a and the SU's observations is obtained 



state is thus given by the belief on the PU's 4 possible states 
over N channels. Hence, we adopt a 4 x A^ matrix A(t) = 



{KAt)] 



n£As,jeCs 



as the belief state of the POMDP, where 



the element A„j {t) represents the conditional probability that 
the state of channel n £ A5 is j £ C5 in slot t, given the SU's 
decision and observation history. We have X^fec - ^njit) ~ 1 
for f £ {1, . . . , T}. Clearly, the space of the POMDP beUef 
states is [0, 1]^^^. The belief state is updated slot by slot 
based on the SU's previous actions and observations. Suppose 
that channel a is selected in slot t. The belief on states in slot 
t + 1 is updated as follows. 

• If n ^ a, the updated belief on the state is not affected 
by the SU's action. We thus have 



Xnj{t+1) = V Xmit)Pnij\l,A), Vj £ Cg. 



(8) 



If 71 = a, the updated belief on the state is related to the 
SU's action and is obtained according to the observation 
Ka{t) ~ k via Bayes rule [S] as 



Aa,(i+1) = - 



E^eCs^a^{t)Pa{3\^A)UA{k\^) 



EtoA-(i)c/A(fcW 



Vj£Cs. 
(9) 



C. Policy Description 

The OSA design for the SU is given by a sensing policy tt^, 
a sensor operating policy its and an access policy tTc. Specif- 
ically, the sensing policy specifies a sequence of functions as 
TTs = {df ,(i|, ...,d^}, where df in slot t maps a belief state 



E. 



{ira,7rs 



,,j{i?s(t)|A(t)} = (Aal + Xas) X ga{t). 



(11) 



We consider two types of protection methods for the reac- 
tive PUs, namely, the short-term conditional collision prob- 
ability (SCCP) constraint and the long-term PU throughput 
(LPUT) constraint, for which the detailed formulations will 
be given in Section |IV] and Section [V] respectively. 

IV. OSA Design under SCCP Constraint 

The SCCP constraint has been widely adopted in the 
literature, e.g., ll9l- llT2l . to protect the PU's transmission by 
imposing a conditional collision probability constraint ( on 
channel n £ A5, and is defined as 

a„(t)^P{$„(<)=l|/(n,t) = 0}<C,Vn£As,Vt£{l,...,T}. 

(12) 
The SCCP constraint ensures that on every channel, the PU 
experiences colhsions from the SU for no more than ( fraction 
of the transmission time. Thus, the PU's throughput under the 
SU's OSA is at least 100 x (1~() percentage of that without 
presence of the SU, if the PU is non-reactive. However, the 
effectiveness of the SCCP constraint in protecting the reactive 
PU's transmission remains unaddressed yet in the literature. In 
this section, we adopt the conventional SCCP constraint and 
design the optimal OSA policy for the SU under the reactive 
PU model. By adopting the optimal OSA policy, we show that 
the SCCP constraint is not able to provide effective protection 
to the reactive PU's transmission. 



TABLE I 

The non-monotonicity of Qt{A{t)\A) with respect to ga{t) under the Reactive PU model 

(with ag = 0.5, Pg = 0.5, aj = 0.9, /3f = 0.9). 



^ — -__^^^^ Actions 
Cases 


/a (0,1) 


/a(l,l) 


^a(l) 


Sail) 


9a (1) 


Qi(A(l)|A) 


case I 





0.5 


0.5 


0.5 


0.25 


0.675 


case 2 





0.6 


0.5 


0.5 


0.3 


0.71 


case 3 





0.6 


0.5 


0.1 


0.3 


0.662 



Suppose a{t) = a. With a derivation similar to that in |]9], 
we obtain from ( fTSl i that 



cr7i(t) = 



{1-Sait))xfai0,t)+Sa{t)xfail,t), if n = a. 

0, if n ^ a. 



(13) 

Note that if n = a, (7a{t) has the same expression as fJ.a{t) 
in ©. Using ([TOll and ([T3]l, the OSA design for the SU under 
the SCCP constraint is formulated as 



(PI): max. E^^^^^^,^^y{y2 Rs{t)\A{l)} 

t = l 

s.t. cr„(t)<C, Vn 6 As, Vi{l,...,T}. 



According to the principle of dynamic programming 
(PI) can be decoupled into T subproblems without loss 
of optimality. Each subproblem is to find a value function 
Vt{A{t)), I < t < T, which represents the SU's maximum 
expected reward that can be obtained from slot t to slot T 
under the SCCP constraint, given the belief state A(<). Given 
the SU's action A{t) = A and the belief state A(t), the reward 
in slot t, I < t < T — I, consists of two parts: the SU's 
expected immediate reward EA{Rs{t)\A.{t)} and the SU's 
maximum expected future reward EA{Vt+i{A{t + l))|A(t)} 
over all possible PU's states on channel a and the SU's 
observations in slot t, where A(i + 1) is updated from 
A(i) according to dH) and ^. In the last slot t :^ T, the 
SU's expected immediate reward alone is the value function 
Vt{A{T)). By averaging over all possible PUs' states and 
the SU's observations Ka{t) = k £ {0,1} and maximizing 
over all SU's actions ^ G A, we obtain the value functions 
expressed as 



1 3 



k=Qi=0 

1 < t<T-l. (14) 

1 3 

l/T(A(r))=max^^A„(r)C/A(fc|i)xfc, t^T. (15) 



A£A' 

k=Oi=0 

By computing the value functions given in (fT4l ) and ( fT5] l 
recursively backward in time and searching over all possible 
actions A{t) G A in each slot, we can find the optimal 
policy {tt*, TTg,TT*} for (PI) that maximizes the SU's expected 
reward over T slots, i.e., l/i(A(l)) in (0, under the SCCP 
constraint. However, (fl4l and (fT5t are generally intractable 
and computationally prohibitive due to the infinite and un- 
countable action space A ^. 



A. Optimal OSA Policy Based on Separation Principle 

In m, a separation principle was proposed to obtain the 
optimal policy for an OSA design problem similar to (PI), 
but under the non-reactive PU model. It is shown in ||9l 
that with this method, the sensing policy can be separately 
designed from the sensor operating policy and the access 
policy. The optimal sensor operating policy 7r| and the optimal 
access policy tt* over any selected channel a for sensing is 
obtained by maximizing ga{t) given in (|4]i, subject to the 
SCCP constraint in slot t. Since the action space of A5 is 
finite and countable, the optimal sensing policy tt* is then 
obtained by standard dynamic programming techniques, given 
TTg and TT*. 

For the ease of presentation, we define Qt{A{t)\A) as the 
SU's maximum expected reward that can be obtained from 
slot t to slot T, given the SU's action A e A in slot t and the 
belief state A{t), i.e., 

3 1 



Qt(A(t)|A) = ^^A„(t)(Z4(A:K)[fc+%i(A(i+l))], 

i=Ofe=0 

l<t<T-l. (16) 
3 1 

QTiA{T)\A)=Y^J2^a7{t)UAik\i)xk, t^T. (17) 



Then we have 



i=Ok=0 



argmax(54(A(i)|A), 



1 < t < T. 



(18) 



As shown in |[9l, the main reason why the separation 
principle holds under the non-reactive PU model is that over 
any selected channel a at slot t, Qt{A{t)\A) strictly increases 
with ga{t) given in (|4|i. However, under the proposed reactive 
PU model, since the PU's channel access probabilities in 
the future slots depend on the SU's current action decision, 
Qf{A{t)\A) is generally not monotonically increasing with 
ga{t)- A simple example with T = 2 and A^ = 1 is shown in 
TABLE I to validate this observation. In TABLE I, we con- 
sider three cases, where the SU has different sensor operating 
points and spectrum access probabilities over channel a in slot 
i = 1. We compute ga{l) and Qi{A{l)\A) based on ^, (fT6l l 
and (n\ in slot t = 1 and compare them under these cases. It 
is shown that both (7a(l) and Qi{A{l)\A) in case 2 are larger 
than that in case 7; however, the SU's actions in case 3 give a 
larger 5a (1) but a smaller Qi{A{l)\A), compared to case 1. 
Hence, Qt{A{t)\A) does not necessarily increase with ga{t)- 
As a result, the proof in |f9l| for the separation principle does 
not apply to our problem under the reactive PU model. 

Interestingly, as shown in the following theorem, a sepa- 
ration principle similar to that in [!9l is applicable under the 



reactive PU model without any loss of optimality. This is true 

mainly due to the fact that the SCCP constraint in (PI) only 

depends on ng and tTc- 

Theorem 4.1: The SU's optimal OSA policy for (PI) under 

the reactive PU model and the SCCP constraint is obtained 

by the following two steps: 

• Step 1; Determine the optimal sensor operating policy 
7r| and the optimal access policy tt* . Specifically, in slot 
t , supposing a{t) = a, the optimal policies of 7r| and 
TT* are given by 



e*(i) is on the optimal ROC curve 
corresponding to (^, 

/:(o,t) = o, 



(19) 



Step 2: Apply the optimal policies 7r| and n* in Step 1 
to obtain the optimal sensing policy tt* by solving the 
following unconstrained POMDP: 

T 

< =argm^axS,^{^i?sW|A(l),^:,<}. (20) 



Proof: Please refer to Appendix lAl 
Remark 4.1: Since the optimal action decisions, (5*(i), 
e*(t), /*(0,i), and /*(l,t), given in ([T9]l, are independent 
of the sensing policy, the optimal sensing policy tt* can 
be separately designed (as shown in Step 2 of Theorem 
l4m i. Since all the optimal actions (5*(i), e*(t), fl{Q,t), and 
/*(l,i) are time-invariant, the optimal polices 7r| and tt* 
are independent of belief states. With f*{0,t) — and 
f*{l,t) = 1, it follows that the SU always trusts the spectrum 
sensing result even though there may exist sensing errors, i.e., 
the SU accesses channel a in slot t with probability 1 when 
the sensing result is Qa{t) = 1, and with probability when 
the sensing result is 6a (i) = 0. 

B. SCCP Constraint for Protecting Reactive PUs 

In this subsection, we show that the SCCP constraint is not 
sufficient to guarantee the PU's benchmark throughput under 
the reactive PU model. 

We first derive the PU's benchmark throughput on each 
channel. Denote the PU's throughput on channel n in slot t 
by Rp^n{t). If the PU on channel n accesses the assigned 
channel in slot t, 1 < t < T, and transmits successfully, it 
will obtain a unit throughput. Given the current belief state 
A(i) and the SU's OSA policies tTs, tts, and tTc, it is then 
easy to obtain that the PU's expected throughput on channel 
n over all possible states in slot t is 

^...u...e{-Rp,nW|A(t)} =P{/(n,i)=0}x (l-a„(t)), 

(21) 
where P{I{n,t) = 0} = A„o(i) + ^n2{t) is the PU's access 
probability on channel n in slot t, and cr„(t) is given in ( fT3] l. 
By summing the PU's throughput over T slots and dividing 

^Note that by setting a" = Qg and /3" = /3g , Vn g As, the proposed 
proof for the sepai'ation principle also holds for the non-reactive PU model. 



the sum by T, the PU's normalized throughput on channel n, 
denoted by Rp^^, is given as 

1 ^ 

^R„-^x^^.,^.,^e{I]^P,«W|A(l)}, VneAs. (22) 

Note that under the non-reactive PU model, the PU's 
channel access probability is independent of the SU's spec- 
trum access policy and thus remains the same in each slot. 
Specifically, from the stationary distribution of the underlying 



l-/3o" 



T, n G . 



Markov chain, we have P{I{n, t) = 0} — i_^a"-0" - 
t e {l,...,r}. Since (T„(t) < C from (|2T]i, °we obtain 
PU's minimum achievable expected throughput in slot t on 
channel n as ., , „„ °^n x (1 — C), where 1 — (^ is the minimum 



1 + Q5-/3J 



probability that the SU does not collide with the PU in slot 
t on channel n. Then according to (|22] |. we obtain PU's 
minimum achievable normalized throughput on channel n 
under the non-reactive PU model as 

1 - /3ff 



1 



x(l-C), VneAc 



(23) 



Taking T„ as the benchmark throughput for PU on channel 
n, we say that the PU system is under effective protection if 
R°Pn — "^'i' ^" ^ ^s^ is guaranteed. 

Next, we show that under the SCCP constraint, the reactive 
PU's normalized throughput is not always larger than or equal 
to the benchmark throughput. We consider two cases. One is 
the single-channel case with A^ = 1, and the other is the 
multi-channel case with A^ > 1. 

For the case of A^ = 1 with n = a, since the SU always 
selects to sense channel a and probably accesses it, A^ = 
1 can be considered the PU's worst case transmission. The 
following proposition shows that the SCCP constraint is not 
able to provide effective protection for PU's transmission with 
N ^l. 

Proposition 4.1: With the optimal OSA policy in ( fT9l l, for 
A^ = 1 and T > 1, the reactive PU's normalized throughput on 
channel a is strictly smaller than the benchmark throughput, 
i.e., i??,, < T,. 

Proof: Please refer to Appendix |B] ■ 

For the case of A^ > 1, since the SU can select one from 
A^ channels to sense, Rp^^ Vn £ A„, will be at least equal 
to that in the worst case with A^ = 1. However, under the 
reactive PU model with T > 1, since Rp^^ < Tq for A^ = 1, 
it is difficult to analyze whether the SCCP constraint is an 
effective PU protection method for A^ > 1. As will be shown 
later by simulations in Section VI, the SCCP constraint is not 
sufficient to guarantee the benchmark throughput of all the A^ 
reactive PUs when A^ > 1. Thus, the SCCP constraint is not 
able to provide effective protection to the PU transmissions. 

V. OSA Design under the LPUT Constraint 

The long-term PU throughput (LPUT) constraint has been 
widely adopted in the SS-based CR systems, to guarantee that 
the PU's transmission quality is always above a predefined 
threshold regardless of the PU's on/off status llT9l -ll2T|. In 
contrast, in the OSA-based CR systems, due to the simplicity 
and effectiveness of the SCCP constraint in protecting the non- 
reactive PU transmissions, the complicated LPUT constraint 



has not been used to protect PU transmissions, to our best 
knowledge. 

For the reactive PU transmissions, however, as shown 
in Section IV-B, the traditional SCCP constraint cannot be 
adopted as an effective protection method. In this section, we 
adopt the LPUT constraint as the protection method, which is 
formulated as 



i?°P„ >T„, Vne 



(24) 



where Rp „ is defined in ( |22] | and T„ is given in (l23T l. Clearly, 
the LPUT constraint formulated in (l24l is able to provide 
effective protection to the reactive PU transmissions, if it is 
satisfied by the SU's OSA policy. Different from the SCCP 
constraint in ( fT2] i. from (l22t . the LPUT constraint takes into 
account the PU's reaction to the SU's collision in each slot. 
By using ([Toll and Q^, the OSA design under the LPUT 
constraint is formulated as 



(P2) : max. E^^,^^^^,^ 



T 

E 



i?s(t)|A(l)} 



s.t. 



Rp.n > T 



rii 



Vn G 



^5- 



To our best knowledge, there is no existing work that 
addresses the finite-horizon long-term constrained POMDP 
problem (P2). As the problem has infinite and unaccountable 
action space, it is challenging to find the optimal policy 
{tt* , 71"! , TT* } for it. Note that for (PI), we have proposed a 
separation principle to design tt* separately from 7r| and tt*, 
since the SCCP constraint in (PI) is only related to 7r|, and 
TT* . However, the LPUT constraint in (P2) is determined by all 
the three policies. Thus, tt* is generally not independent of 7r| 
and TT*, which implies that (P2) cannot be solved optimally by 
the separation principle. However, a suboptimal policy for (P2) 
can be found based on the separation principle. In this section, 
we first focus on the single-channel case with A^ = 1 for (P2), 
where only 7r| and it* need to be determined. We then propose 
the suboptimal policy for the general multi-channel case with 
A'' > 1 by extending the results from A^ = 1 to A^ > 1 based 
on the separation principle. 

A. Single-Channel Case: Optimal OSA Policy Structure 
For (P2) in the single-channel case with n = a, we have 

T 

(P2-S) : max. ^,,,,J V i?s(<)|A(l)} 

S.t. Up a -^ I ai 

with action space 

{((e„(i)A(<)),U(0,t),/„(l,t))): 

(eaWAW)eA5(a),(/,(0,i),/a(l,t)) e [0,1]'}. 

To simplify (P2-S), in this subsection, we first propose 
an equivalent problem to (P2-S), namely, (P2-S-1), through 
which we find the optimal OSA policy structure. Based on 
the optimal OSA policy structure, we reduce (P2-S-1) to 
another problem (P2-S-2) with a significantly reduced action 
space. To facilitate our analysis, we first present the following 
proposition. 



Proposition 5.1: For the case of A^ ~ 1, given any 
SU's OSA policy tt = {ttstTTc], with the resultant PU's 
normalized throughput R°p ^ > Ta and the SU's reward 

i?7riX]t=i^s(^)|A-(l)k we can find another policy tt = 
■^ TTjjjTTj, >, with the resultant PU's normalized throughput 
R°p'^ = Ta and the SU's reward E^' { J^f^^ Rs{t)\A{l)} > 
EAELiRsit)\Ail)}. 

Proof: Please refer to Appendix |C] ■ 

According to Proposition 15.11 the optimal policy tt* = 
{tt*s,tt*} is selected to ensure that R°p^ = Ta. Thus, (P2-S) 
is equivalent to 



(P2-S-1) : max. E^,, ^ ^ {Y^ Rs{t)\ A{1)} 

t = \ 

S.t. Rp jj = I Q. 



Proposition 5.2: The structure of the optimal policy tt* = 
{TTg,TT*} for (P2-S-1) is given as follows. 

5*(i) is in general time-variant and to be determined, 
e*(t) is on the optimal ROC curve corresponding 

to 6:it), 

/a*(0,i) = 0, 

(25) 
Proof: Please refer to Appendix |D] ■ 

Remark 5.1: As shown in ( |25] |, the optimal spectrum access 
policy TT* for (P2-S-1) is the same as that for (PI), and the 
optimal sensor operating point (<5*(t), e*(t)) for (P2-S-1) is 
on the optimal ROC curve as that for (PI). However, different 
from the time-invariant case for (PI), the optimal PM decision 
for (P2-S-1) is in general time-varying. As proved in Appendix 
ini 5l{t) for (P2-S-1) is related to the current belief state A(t) 
and thus needs to be determined adaptively over time. This 
indicates that the spectrum sensor design plays a crucial role in 
protecting reactive PU's transmission under LPUT constraint. 

By applying (l22l) and ( 1251 ) and without loss of optimality, 
(P2-S-1) is reduced to 

T 

(P2-S-2) : max. £;,,{ V ii!s(t)|A(l),<} 

ITS ^ ' 

t=l 

1 T 

-K,{5^i?P,,(t)|A(l),<}=T,, 



s.t. 



where {5a{t), £a{t)) determined by tts is on the optimal ROC 
curve. Thus, to find the optimal policy 7r| for (P2-S-2), we 
only need to search the action space of {Sa{t) : 6a{t) £ [0, 1]}, 
which is greatly reduced over that of (P2-S-1). 

Since (P2-S-2) is reduced from (P2-S-1) and (P2-S-1) is 
equivalent to (P2-S), substituting the optimal 7r| for (P2-S-2) 
into (l25l) yields the optimal OSA poHcy for (P2-S-1) and thus 
(P2-S). Hence, in the following, we focus on solving (P2-S-2). 

However, finding 7r| for (P2-S-2) is of high complexity, 
mainly due to the following two reasons: 1) the infinite 
and unaccountable action space of (P2-S-2), and 2) the non- 
deterministic POMDP belief state transitions. As the com- 
plexity due to the first reason is obvious, here we explain 



the complexity due to the second reason. From (fl4t and 
(fTSl l, to maximize the SU's throughput in (P2-S-2), i.e., to 
find Vi(A(l)) under the LPUT constraint, we need to obtain 
Vt(A(t)) for all i e {2, . . . , T}. As shown in dSll, given A(t) 
and the SU's OSA actions in slot t, 2 possible belief states 
in slot t + 1 exist with non-zero probability, corresponding to 
the 2 possible observations, respectively. Thus, for the case of 
A^ = 1, given the initial behef state A(l) and the SU's OSA 
policies TTg and tt*, the complexity of computing Fi(A(l)) is 
0(2^), which is not scalable with T. 

In the following, we focus on designing a suboptimal 
policy for (P2-S-2) that can meet the LPUT constraint. We 
use the method given in Appendix B to calculate the PU's 
normalized throughput, which is similar to the SU's through- 
put calculation in ( fTSI l and iTT\ . However, also due to the 
non-deterministic POMDP belief state transitions, it is of 
exponentially increased complexity over time to find a policy 
that can meet the LPUT constraint. Motivated by |!6|, we 
note that the complexity due to the non-deterministic POMDP 
belief state transitions is reducible, by converting the POMDP 
into an equivalent MDP with deterministic state transitions. 
In the following subsections, we first construct the equivalent 
MDP, and then based on the MDP, we propose a suboptimal 
policy for (P2-S-2), which satisfies the LPUT constraint in 
(P2-S-2). 

B. Equivalent MDP with Deterministic State Transitions 

In this subsection, we first convert the POMDP for (P2- 
S-2) into an MDP with deterministic state transitions, and 
reformulate the LPUT constraint in (P2-S-2) based on the 
MDP. We then show that if a policy satisfies the MDP- 
based LPUT constraint, it will also satisfy the POMDP-based 
counterpart in (P2-S-2). 

An MDP in general consists of the following elements 1241 : 
a set of time-slots {!,... ,T}, a set of system states (with 
transition probabilities), actions and rewards, for each of the 
time-slots. In the following, we formulate the MDP model 
for the SU's OSA by specifying these elements according to 
|l6l. Specifically, for the single-channel case with n = a, the 
MDP state in slot t, 1 < t < T, is denoted by a 4-element 
vector n{t) = {wai(t)}.gj,_,, where ujat{t) G [0,1] is the 
conditional probability that the reactive PU is at the i-th state 
on channel a in slot t, given the SU's action history. Note 
that the MDP state space, given by [0, 1]^, is the same as the 
POMDP belief state space, given in Section III-B, for A^ = 1. 
We assume that 0(1) = A(l) in the initial slot t = 1. Based 
on the current MDP state A(t) in slot t, the SU selects an 

action A{t) = ((ea,A4(i),'5a,AlW), (/a,A4(0,t),/a,x(l,t))) 

for OSA, where (ea,>t(i),(5a,x(i)) G ^s{a) is the sensor 
operating point and (/a,x(0, t), /a,x(l, t)) G [0,1]^ is the 
channel access probability. Thus, the MDP action space is the 
same as the POMDP action space for N = 1. We then follow 
the optimal OSA policy structure in (l25T l and set 

{^a.M{t),^a.M{t)) locatcs on the optimal ROC curve, 

./■a,>f(l,t) = l, 

VtG{l,...,r}, (26) 



where in slot t, the SU only needs to determine the PM action 
Sa,M{t). Denote Pa{n{t + l)\n{t),SaMit)) as the MDP 
state transition probability from state n{t) = {ujai{t)} , „ 
in slot t to state n{t + 1) = {(^ai{t + 1)} pj- in slot t + 1 on 
channel a, given the SU's selected PM action 5a,M{t) in slot 
t. From Fig. 3(b), w„(t + l)=Ej=o^aj(i)^a(j|j,'5a,.M(i)). 
i, j G Cs, where Pa{'i\j,Sa,M{t)) is the state transition 
probability Pa{i\j,A) given in Fig. 3(b), with A reduced to 
Sa,M{t) by applying ( |26] |. Thus, we obtain the MDP state 
transition probability as 

Pa{n{t + i)\n{t),6aMt)) 

^ /l,ifc^„(t+l)-E'=0^ajW^4«U>'5aAlW)'ViGCs. 

0, Otherwise. 

(27) 



From (|27| |. the MDP state transition is deterministic. That is, 
given da.M{t) selected at MDP state ft{t) in slot t, there is 
only one possible MDP state n{t + 1) in slot t + 1. 

Denote the PU's throughput in slot t on channel a by 
PP^it), which is 



R^^it) =. {cVaoit) + 0Ja2it)) X (1 - 5a,A4 (0)- 



(28) 



The PU's normalized throughput on channel a over T slots is 
thus given by ^ J2t=i -^p^ai^)- With the benchmark through- 
put Tq, given in (l23T l. the MDP-based LPUT constraint is 
formulated as 



1 ^ 



i?Ra(i)=T,. 



(29) 



Note that due to the deterministic MDP state transitions, 
with a complexity analysis similar to that in Section V- 
A, it is easy to find that the complexity in computing 
Tj=iRp!ait)/T under the MDP policy is 0{T), which 
is substantially reduced as compared to that based on the 
POMDP 

Proposition 5.3: Given an MDP policy tt^ ^^t' which speci- 
fies a PM decision 6* j^{t) in slot t, 1 < t <T, we construct 
a POMDP policy ttJ for (P2-S-2), where the corresponding 
PM decision in each slot t is S*{t) = S*j^{t). If the MDP- 
based LPUT constraint in ( |29] l is satisfied under ttJ^, the 
POMDP-based LPUT constraint in (P2-S-2) is also satisfied 
under ttJ. 

Proof: Please refer to Appendix |E] ■ 

C. Suboptimal Policy 

In this subsection, by studying the MDP-based LPUT 
constraint, we derive a suboptimal policy for (P2-S), such that 
the LPUT constraint in (P2-S) is satisfied. In the following, 
we first give a sufficient condition for satisfying the MDP- 
based LPUT constraint, based on which, we propose an 
MDP-based policy ttJ^, which can satisfy the MDP-based 
LPUT constraint. Based on ttJ^ and Proposition 15.31 we 
then obtain a suboptimal policy for (P2-S-2), such that the 
LPUT constraint in (P2-S-2) is satisfied. Finally, we obtain the 
suboptimal policy for (P2-S) by substituting the suboptimal 
policy for (P2-S-2) into the optimal OSA policy structure in 



10 



1) A sufficient condition for satisfying MDP-based LPUT 
constraint : From (|29] l. to satisfy the LPUT constraint, we take 
Ta X T as the PU's throughput requirement over all T slots S^ m(^) ^^ obtained as 
in the MDP. We then denote Xpa(t) as the PU's throughput 
requirement from slot t to slot T in the MDP and have ^aM 

Xp,a(l) = T,xT, t==l, (30) 

^p,aW=Xp,,(t-l)-i?^„(i-l), Vte{2,...,r}. (31) 



Proof: Please refer to Appendix |F] ■ 

With Proposition 15.41 the SU's maximum allowable PM 



(t) = rain 1, 



UJal{t) X m2(i) + UJaait) X TO3(t) - Xp^a{t) , mi{t) 



{uJaa{t) +LLla2{t)) X TO4(t) 



7714 (t) 



Given the PU's obtained throughput R^g^{t — 1) in slot t — 1, 
from ( I3TI 1 and by calculating backward in time, we observe 
that if Xp^a{t) in slot f is satisfied, Xp,a{t — 1) in slot t — 1 
is achieved. Thus, we can easily show that if Xp^a{T) in 
the last slot T is satisfied, the PU's throughput requirement 
Xp,a{^) is achieved, i.e., the LPUT constraint given in ( |29] l is 
met. Note that Xp^a{T) can be satisfied by selecting 6a,M{T) 

such that Xp,,(rj = Ko(r) + L0a2{T)) X (1 - <5a,>,(T)). 

Since 5a.M{T) G [0,1], the following inequality is obtained 
as a sufficient condition for satisfying (|29] |: 

< Xp^a{T) < Uao{T) + UJa2{T). (32) 

2) MDP-based policy n'^ j^: The MDP-based policy ttJ ^ 
is given by the PM actions {(5o,a^(1), ..., (5a,>!(r)}. In the 
following, we derive the minimum required PM, denoted 
by 6^j^{t), and the maximum allowable PM, denoted by 
S^ ji^{t), in slot t, such that (l32T i is satisfied if the SU selects 

SaMt) e [StMit)^S^.Mit)l Vt e {1, . . . ,r}. 

From (|30] | and ( I31l l, to ensure Xp.Q(r) > 0, we need 
Xp,a{t) > in all the previous slots with i S {1, . . . , T- 1}. 
By substituting Xpa{t) >0 to ( [3TT l and using ( |28T l, we obtain 

l<t<T-l. When t = T, if 



J^P.a(*) 



-'^p,a(r) < ujaQ{T)+iUa2{T) is Satisfied, we only need to set 

Sa.MiT) = ^-zd^rp^iT) to satisfy Xp,a{T). Thus, we 
obtain 



''a,M 



(t) = max (0,1 



XpAt) 



l<t<T. (33) 



UJaoit) +UJa2{t)' 

Proposition 5.4: To ensure Xp^a{T) < uJao{T) + LUa2{T), 
the SU's PM Sa.M{t) selected in slot t needs to satisfy the 
following inequality: 

,UJal{t) X 7712 (i) + UJaiit) X TO3(i) - Xp,a{t) 



Sa.Mit) <- 



where 

' rni{t) 



(uJaoit) +0Ja2{t)) X m4(t) 



mi{t) 



(34) 



l + (l-ag)xmi(t+l) + agxTO2(i + 
m2(t) = (l-/35)xmi(i+l) + /3gxm2(t+l), 
m3(i) = (l-/3?)xmi(i+l) + /3fxm3(i+l), 
m4(i) = l + «-Q:;5)xmi(t+l) + agxm2(t 

-a°xm3(^ + l), 

for 1< i < T and 



1) 



(35) 



mi(T) 
m2(T) 
m3(T) 
mi{T) 



1, 
0, 
0, 
1, 



(36) 



(37) 

_Clearly, if (5a,^ (i) £ [(5f ^ (i) , 5^j^ (i)] in slot i is selected, 
is guaranteed and thus the MDP-based LPUT constraint 
given in (|29] l is satisfied. We then propose the MDP-based 
policy ttJ j^ by specifying 



Cai(^) 



5a.Mii) 



^(t) ^ (S^.Mit) 



StMit))^ 



(38) 



where S*j^{t) G [<5a_M(i),<5^A4(i)] is guaranteed by select- 
ing ipit) G [0, 1] in slot t. 

3) POMDP-based suboptimal policy: We first consider the 
POMDP-based problem (P2-S-2). From Proposition |53] by 
setting 5*(t) — S*_f^{t), 1 < t < T, we find a suboptimal 
poUcy Wg for (P2-S-2), which satisfies the LPUT constraint 
in (P2-S-2). 

Next, we consider the original POMDP-based problem (P2- 
S). Note that (P2-S-2) is reduced from (P2-S-1) without loss 
of optimality and (P2-S-1) is equivalent to (P2-S). Thus, by 
substituting the suboptimal spectrum sensor operating policy 
TTg into the optimal OSA policy structure in 



we obtain 
a suboptimal OSA policy for both (P2-S-1) and (P2-S) as 

f s*{t) - 5^^^{t) + ij{t) X (5,^;_^(t) - e^^(i)), 

where ^'{t) G [0,1], 
e*(t) is on the best ROC curve corresponding to 6*{t), 

/a*(0,t)-0, 
l/a*(l,i)-l, 

(39) 
Since ttJ satisfies the LPUT constraint in (P2-S-2), the sub- 
optimal OSA policy in (|39] l satisfies the LPUT constraint in 
(P2-S-1) and thus (P2-S) with equahty. 

D. Multi-Channel Case 

At last, we consider the general case with A^ > 1 for (P2). 
In this case, although the spectrum sensing policy tTs generally 
depends on the sensor operating policy -ks and the spectrum 
access policy tTc, a suboptimal policy for (P2) can be obtained 
by separately designing tt^ from -ks and tTc, i.e., applying 
a separation principle. Based on the results for A^ = 1, a 
suboptimal policy for A^ > 1 is proposed as follows. 

• Step 1: On each channel n G A5, the SU selects the sen- 
sor operating point and the spectrum access probabilities 
in slot t according to ttJ and tt*, as given in ( [39] l. 

• Step 2: Apply ttJ and tt* to obtain the SU's sensing 
policy TT*, which determines the channel to be sensed in 
slot t, by solving the following unconstrained POMDP: 



1 
rgm^axS,^{^i?s(t)|A(l),7rJ,<}. 



(40) 



for t = T. 



The sensor operating policy ttJ and the spectrum access 
policy TT* in ( [39] l are provided such that the PU's normalized 



throughput on channel n, n G As, equals the benchmark 
throughput T„, if the SU always selects this channel to sense. 
When iV > 1, the SU has the option to select one from N 
channels to sense. Thus, the PU's normalized throughput on 
channel n will be at least T„. Hence, the LPUT constraint in 
(P2) over each channel is satisfied by the proposed policy. 

VI. Numerical Results 

In this section, we show by simulation the SU's and PU's 
throughput under the proposed reactive PU model. We assume 
an energy detection based spectrum sensor for the SU, where 
the background noise and the received PU signal are modeled 
as independent white Gaussian processes. Let M be the 
number of PU signal measurements, and ?7„ be the decision 
threshold for channel n G Ag. Let k^ q and k^ ^ denote the 
power of the noise and received PU's signal on channel n, 
respectively. Under the Neyman-Pearson (NP) criterion, the 
PFA and PM in slot t G {1, . . . , T} are obtained as E^j: 

where j{a,m) = {l/r{m))xL f"~^e~*dt is the incomplete 



gamma function [551. The optimal decision threshold r/*(t) 
in slot t of the energy detector is chosen such that (5„ (t) = (, 
if the SCCP constraint is adopted, or (5„(t) = (5*(t), if the 
LPUT constraint is adopted. Furthermore, we set k^j q = 
dB, K^ 1 = 5 dB, Vn G As, and M = 30. 



A. Single-Channel Case 

We first study the PU's and the SU's performance in the 
single-channel case. We set ag ~ 0.1, /3q ^ 0.2, a° — 0.9, 
and /3f = 0.95. According to the stationary distribution of 
the underlying Markov chain, the PU's initial channel access 
probability in slot i = 1 is obtained as 



\fl „ = 0.889. 
We consider two cases: ( = 0.05 and ( = 0.1. From ( I23I I. 
we obtain the PU's benchmark throughput Ta = 0.846 when 
C, = 0.05, and Tq = 0.8 when ( = 0.1. In each case, we 
first show the SU's normalized throughput under the non- 
reactive PU model and the reactive PU model, both subject to 
the SCCP constraint. The non-reactive PU model is obtained 
equivalently by setting a" = ag = 0.1 and /?" = /3q = 0.2 
in the reactive PU model. The SU's normalized throughput 
under the non-reactive PU model is computed based on the 
SU's optimal OS A policy in ||9|. We then focus on the reactive 
PU model and compare the PU's and the SU's normalized 
throughput under the SCCP constraint and the LPUT con- 
straint in both cases of ^ = 0.05 and ( = 0.1. Specifically, 
under the SCCP constraint, the SU adopts the optimal sensor 
operating policy 7r| and the optimal spectrum access policy 
TT*, as shown in Section IV, while under the LPUT constraint, 
the SU adopts the suboptimal sensor operating policy tt^ with 
'0(i) — 0.8 and the optimal access policy it*, as shown in 
Section V. 

Fig. |4] shows the SU's normalized throughput under the 
non-reactive PU model as well as the reactive PU model, 
by adopting the SCCP constraint. We observe that the SU 
achieves higher throughput in the latter than in the former 
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Fig. 4. SU's normalized tliroughput under tlie SCCP constraint. A'^ = 1. 
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Fig. 5. PU's normalized throughput. A'^ = 1. 



model. This is mainly because that the reactive PU reduces 
but the non-reactive PU remains the channel access probability 
after a collision with the SU occurs. Since the non-reactive 
PU has the same channel access probability, when A^ = 1, 
the expected channel access opportunities are unchanged over 
time for the SU. As a result, the SU's throughput is a constant 
under the non-reactive PU model. In addition, we observe that 
the SU's throughput under the non-reactive PU model remains 
the same in both cases of C = 0.05 and C = 0.1. However, 
the SU's throughput under the reactive PU model is higher in 
the case of C = 0.1 than that in the case of C, = 0.05 due to 
the more relaxed SCCP constraint. 

Fig. |5] compares the PU's normalized throughput with 
the benchmark throughput. According to Section IV-B, the 
benchmark throughput, which remains as a constant over 
time in both cases of ^ = 0.05 and C, = 0.1, is actually 
the non-reactive PU's normalized throughput under the SCCP 
constraint in the single-channel case. Thus, the non-reactive 
PU is effectively protected by the SCCP constraint. However, 
under the reactive PU model, it is observed that if the SU 
adopts the SCCP constraint, the PU's normalized throughput 
is lower than the benchmark throughput Ta in both cases of 
C = 0.05 and C = 0.1 if T > 1; and thus the PU is not 
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Fig. 6. SU's normalized tlirougiiput under tlie reactive PU model. A^ = 1. Fig. 7. Sum tliroughput under the reactive PU model. N = \. 



protected properly, which is in accordance with Proposition 
14.11 Furthermore, the PU's throughput loss is more substantial 
in the case of ^ = 0.1 than in the case of (^ = 0.05, since 
the PU allows more collisions when C is larger On the other 
hand, if the SU adopts the LPUT constraint, we observe that 
the PU's normalized throughput is equal to the benchmark 
throughput in both cases, and thus the PU is protected as 
expected. 

Fig.|6]shows the corresponding SU's normalized throughput 
under the reactive PU model. We observe that the SCCP 
constraint leads to a higher SU throughput than the LPUT 
constraint in both cases of C = 0.05 and C = 0.1. This is 
because the SU can exploit the PU's reaction to get more 
throughput under the SCCP constraint. It is also observed from 
Fig. |6] that, unlike the case under the SCCP constraint, the 
SU's throughput under the LPUT constraint does not always 
increase over T. For example, the SU's throughput obtained 
with T = 5 is lower than that with T = 4. In addition, by 
comparing the SU's normalized throughput in both cases, we 
observe that the SU achieves higher throughput with ^ = 0.1 
as compared \a C, ~ 0.05. This is consistent with the PU's 
higher throughput loss when C, ~ 0.1 as shown in Fig. |5] To 
evaluate the performance of the proposed suboptimal OSA 
policy for the SU, in the following, we propose an upper 
hound of the SU's normalized throughput for the single- 
channel case, under the constraint that the PU must achieve the 
benchmark throughput. By noticing the fact that unit through- 
put is the maximum normalized throughput that a channel can 
provide, the upper bound is given by the difference between 
the unit throughput and the PU's benchmark throughput on 
the channel. Since the SU's throughput loss due to sensing 
errors is not considered, the upper bound is higher than the 
SU's maximum normalized throughput when the PU achieves 
the benchmark throughput. As shown in Fig. |6] we compare 
the SU's normalized throughput with the upper bound in both 
cases of C = 0.05 and C = 0.1 under both the SCCP and 
LPUT constraints. It is observed that the SU's normalized 
throughput under the suboptimal policy is always lower than 
the upper bound under the LPUT constraint. However, under 
the SCCP constraint, after T = 3 in both cases of C = 0.05 



and C, = 0.1, the SU's normalized throughput becomes higher 
than the upper bound, since the PU's achievable throughput 
deviates from the benchmark throughput. 

Fig. |2] compares the sum of SU's and PU's normalized 
throughput under the SCCP and LPUT constraints in both 
cases of (^ = 0.05 and C = 0.1, where the reactive PU model 
is considered. It is observed from Fig. |7]that, in both cases, 
the LPUT constraint leads to a higher sum-throughput than the 
SCCP constraint. It is worth pointing out that, although the 
sum-throughputs under the SCCP and LPUT constraints are 
close, the individual portions of the SU's and PU's throughput, 
as shown in Fig. |5] and Fig. |6] are very different under these 
two constraints. 

B. Multi-Channel Case 

Next, we consider the multi-channel case by assuming 
A^ = 3. Since the performances of SU's and PU's through- 
put under the non-reactive PU model in the multi-channel 
case are similar to those in the single-channel case, re- 
spectively, we only consider the reactive PU model in this 
case. The reactive PU model is given by four vectors ao = 
(0.1,0.1,0.05), /3o = (0.1,0.2,0.6), ai ^ (0.9,0.9,0.9), 
and /8i = (0.95, 0.95, 0.95). According to the stationary 
distribution of the underlying Markov chain, the PU's initial 
channel access probabilities in slot t = 1 are obtained as 
(0.9, 0.889, 0.889). Since the results of the SU's and the PU's 
throughput with C, = 0.1 are similar to those with <^ = 0.05, 
we only show the case of C = 0.05 under the reactive PU 
model. From (l23l l, the PU's benchmark throughput is obtained 
as T = (Ti,T2,T3) = (0.855,0.846,0.846) with C = 0.05. 
Similar to the single-channel case, the SU adopts ttJ with 
Tpit) = 0.8. 

Fig. [8] shows the PU's normalized throughput under the 
SCCP constraint. It is observed that the PU's normalized 
throughput on channel 1 is higher than the benchmark 
throughput Ti and remains as 0.9 over T, which is the PU's 
throughput with the absence of SU, while the PU's normalized 
throughput on channel 2 and channel 3 vary over T. This 
indicates that the SU only selects channel 2 and channel 3 
to access in this example, since the SU is able to achieve 
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Fig. 8. PU's normalized tliroughput under the SCCP constraint. A^ : 
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Fig. 10. SU's normalized throughput under the LPUT constraint. N = 3. 
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Fig. 9. PU's normalized throughput under the LPUT constraint. A'^ 
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Fig. 11. Sum throughput under the reactive PU model. A^ = 



more reward on these two channels, where the PUs have 
lower initial channel access probabilities than on channel 1. 
We also observe that the PU's throughput on channel 2 is 
always larger than its benchmark throughput T2. However, 
when T > 2, the PU's throughput on channel 3 decreases 
over T and becomes lower than the benchmark throughput 
T3. This indicates that the SU selects channel 3 to access in 
most of the slots. Thus, the PU in channel 2 is protected as 
expected, while the PU on channel 3 is not protected properly. 
Hence, as we discussed in Section IV, the SCCP constraint is 
in general not able to provide effective protection to all the 
reactive PUs when iV > 1. 

Fig. |9] shows the PU's normalized throughput under the 
LPUT constraint. Similar to Fig. |8]under the SCCP constraint, 
the PUs on channels 1 and 2 are both properly protected 
by the LPUT constraint, since their normalized throughput 
are larger than their respective benchmark throughput Ti 
and T2, respectively. Different from Fig. |8] where the PU's 
normalized throughput on channel 3 is not guaranteed to meet 
the benchmark throughput T3, we observe from Fig. |9]that 
the throughput under the LPUT constraint is higher than T3, 
i.e., the PU on channel 3 is protected properly. This shows 
that the LPUT constraint provides effective protection to all 



the reactive PUs. 

Fig. [To]compares the SU's normalized throughput under the 
SCCP and LPUT constraints. Similar to the single-channel 
case shown in Fig. |6] the SU achieves higher throughput 
under the SCCP constraint than under the LPUT constraint. 
Compared with the single-channel case of C = 0.05 in 
Fig. |6] we observe that the SU achieves higher throughput 
in the multi-channel case under both the SCCP and LPUT 
constraints. This is because when A^ > 1, the SU has more 
flexibility in selecting channels that are more likely to be 
unoccupied to access. 

Fig. [TT] compares the sum of SU's and PU's normalized 
throughput on each channel under the SCCP and LPUT 
constraints. It is already shown in Fig.|8](SCCP constraint) and 
Fig. |9] (LPUT constraint) that the SU only accesses channel 
2 or channel 3 and does not access channel L Therefore, we 
observe from Fig. [TT] that the sum-throughput on channel 1 
under both constraints remains unchanged over time, which 
is equal to the PU's normalized throughput on channel I. It 
is also observed from Fig. [TT]that, on channel 2 and channel 
3, the LPUT constraint leads to a higher sum-throughput than 
the SCCP constraint, which is similar to the single-channel 
case, as shown in Fig. [7] 
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VII. Conclusion 

In this paper, we studied a practical multi-channel CR net- 
work overlaid with reactive PUs. We proposed a new channel 
access model for the reactive PU, in which the probability for 
the PU to access a particular channel is related to the SU's past 
access decisions. Under this model, we formulated the optimal 
OSA design for the SU's throughput maximization as a 
constrained POMDP problem. We considered both SCCP and 
LPUT constraints to protect the reactive PU's transmission. 
For the SCCP constraint, we developed the optimal OSA 
policy via a separation principle. For the LPUT constraint, 
we developed the structure of the optimal OSA policy. In 
order to reduce the computational complexity, we converted 
the POMDP into an equivalent MDP with deterministic state 
transitions. With the reformulated LPUT constraint, we pro- 
posed a suboptimal policy of lower complexity. It is shown 
that the proposed policy guarantees PU's throughput for both 
single-channel and multi-channel cases. 

Appendix A 
Proof of TheoremI4.1I 



First, we present the following lemma. 
Lemma A.l: {el{t),5l{t)) and (/^(O, i), /^(l, i)) given in 
(fT9] l are the optimal solutions to the problem 

max Ci X ^a(i) + C2 x ga{t) + C3 

(«a(t),«a(t))eAi(a(t)), 
(/a(0,t),/a(l,t))e[0,ll2 



s.t. 



P^a{t) < C- 



where Ci, C2, and C3 are constants with Ci > 0, C2 > 0, 
ga{t) is given in (|4]i, and ^a{t) is given in (|7|i- 

Proof: In ||9l, the authors proved that (fT9] l is the optimal 
solution to the problem 

max ga{t) 

(«a(t),<Sa(t))eAi(a(t)), 
(/a(0,t),/a(l,t))e[0,ll2 

S.t. Ha{t) < C- 

By applying ( fT9l ), the constraint function fJ.a{t) achieves ( 
with equality. Since Ci > 0, C2 > 0, when the solution is 
given by (fT9] l. the objective function is maximized. Lemma 
lA. II thus follows. ■ 

With Lemma A.l, we now prove the separation principle. 
By mathematical induction, in the following, we prove that 
iT% is the SU's optimal action decision for (PI) in each slot 
t, 1 < t < T, and with ([T9]l, the value function Vt{A{t)) in 
slot t is of the following form: 

Vt{A{t)) ^Dt X (Aao(i) + K-iit)) +Ftx A,i(i) 

+Ht X Xasit) + zt, l<t<T (41) 

where Dt > 0, Ht > Ft > 0, and zt is independent of A(t). 
Specifically, in the last slot t = T, suppose that channel a is 
selected to sense. From ( [TtI i. we then have Qt{A{T)\A) ^ 
(Aai(r) + A,3(r)) X ga{T). Since A„i(T) + XasiT) > 0, 
from Lemma A.l, it is easy to verify that ([19) is the SU's 
optimal action decision in slot T, such that Qt{A{T)\A{T)) 
is maximized subject to the SCCP constraint. Applying ( fT9] l to 
QTiA{T)\A), we obtain T/t(A(T)) = (Aai(T) + XaaiT)) x 
(1 — e*(T)), which follows the form given in dTTl i. 



Now suppose that in slot t, channel a is selected to sense, 
and iT% is the SU's optimal action decision in slot t with value 
function Vt{A{t)) given by (1411 1. Next, supposing a(t — 1) = n 
in slot t — 1, given A{t — 1), we derive the optimal action 
decision in slot t — 1 and the value function Vt_i(A(i — 1)) 
in the following two cases: 

> Case 1: n ^ a. According to (|8]l and ( fTSl l and after 
some algebra, we obtain Qt-i{A{t — l)|A(i — 1)) = 
ien{t - l)/„(0,i - 1) + (1 - e„(t - l))/„(l,i - 1)) X 
(A„i(i - 1) + A„3(i - 1)) + Vt{A{t)). Since from (©, 
A{t) is independent of the SU's action A{t-l), Vt{A{t)) 
is treated as a constant. Then according to Lemma A.l, 
( fT9] l is the SU's optimal action to maximize Qt-i{A{t — 
l)\A{t - 1)) subject to the SCCP constraint. Applying 
([fill to Qt-i{A{t-l)\A{t-l)) yields that Vt_i(A(t - 

1)) = (l-e:(t-l))x(A„i(t-l)+A„3(t-l))+y*(AW)- 
Clearly, Vt_i(A(i — 1)) follows the form given in dTTT i. 
* Case 2: n ^ a. Similarly to Case 1, according to (O 
and ( [TSI l. we can obtain the expression of Qt-i{A{t — 
l)\A{t - 1)), in which Vt{A{t)) is related to the SU's 
action A(t — \\ By adopting the same method as in Case 
7, it is easy to verify that ( [T9] l is the SU's optimal action 
and the resultant Vt-i(A(i — 1)) follows the form given 
in dHJ, with A-i = CXHtal - Dta\ - Fta^ + Dtaf,] + 
Ftag+A(l-ag) > 0, Ft-i = FtP1 + Dt-DtPl + l- 
el{t-l) > 0, Ht-i = Ht(3'l+Dt-DtP1+l-el{t-l) > 
and Zt-i = Zf. Obviously, Ht-i > Ft-i. 

Hence, iT% is the SU's optimal action in each slot under 
the SCCP constraint. From ([19), the optimal sensor operating 
point and the optimal access probabilities are constant and 
independent from the channel selected to sense. As a result, 
we can separately design the optimal spectrum sensing policy 
as shown in Theorem l4.1l without loss of optimality. Theorem 
14. II is thus proved. 

Appendix B 
Proof of Propositions. II 



For the case of A^ = 1 with n = a, there is only one PU 
and SU pair sharing one channel and the SU's polices are 
reduced to be tts and tTc. Given the current belief state A(i), 
we use Gf (A(i)|7r) to denote the PU's throughput on channel 
a from slot t to slot T under the SU's policy it = {tts, t^c}- 
Similar to Qf{A{t)\A) for the SU in ([16) and ([17), from dH) 
and (|22) and with the fact that aa{t) = fia{t), t e [0,1], for 
A^ = 1, we have 

G,(A(i)k)-(AaoW + A,2(t))(l-Ma(i)) 

3 1 



+ J2J2 K^UA{k\^)Gt+liAit + l)\^T), 



i=0 fc=0 



i<t<r-i, (42) 

GT{A{T)\7T)^{Xao{T) + Xa2{T)){l-^ia{T)), t = T. (43) 

It is easy to find that 

T 

Gi(A(l)W=i?^{^i?P,,(i)|A(l)} 
t=i 



which is the PU's throughput on channel a over all T slots. 
Thus, R°p^ = Gi(A(l)|7r)/r. 

From (jllli and (|l3]l, to find Gi(A(l)|7r), we need to 
compute Gt(A(i)|7r) for all t G {!,..., T}. We consider 
PU's throughput from slot t to slot T under two cases. One 
is with the SU's optimal policy for (PI) under the reactive 
PU model. The other is with the SU's optimal policy in 
[|9l , which is under the SCCP constraint and under the non- 
reactive PU model. For notational convenience, we denote 
PU's throughput from slot t to slot T obtained in the former 
case by Gt, and denote that in the latter case by Gj. Note 
that Gi/T = Ta- We take G^ as a reference and show that 
Gi < Gi- Since the proof is similar to that in Appendix A, 
in the following, we only provide the proof sketch. 

Based on mathematical induction and by computing back- 
ward in time from (l42l i and (|43^ . it is easy to find that in slot 
t, Vt G {1, . . . , T} and T > 1, Gt is of the following form: 

Gt = (Aao(i) + Aao(i))x(7t + Aai(t)xu;f + Aa3(i)xTOt, (44) 

where the coefficients qt, Wt and mt are time-varying and 
depend on {a^ , /Sq , af , (3i) . Since the SU's optimal polices 
under the two cases are the same, given A(l), the belief states 
A(i) under two cases in each slot are also the same. Moreover, 
given A(i), Gf has the similar expression as Gt in slot t, 
but with different coefficients, which are denoted by q^, w^, 
and 771 J. Note that by reducing a J and /3° to a^ = ag and 
Pi = /?Q , the reactive PU model is reduced to the non-reactive 
counterpart. Thus, by doing so, in each slot, the coefficients of 
Gf, i.e., qt, Wt, and nit, are reduced to those of G^, i.e., g^, 
Wj, and TTij, respectively. Based on mathematical induction, 
we find that in slot t, qt < qt^ Wt < w^ and m^ < m^. Thus, 
we obtain Gt < G[, Vt g {1, . . . , T). 

Hence, we have Gi/T < G'^/T, i.e., for (PI) and under the 
SU's optimal policy, the PU's normalized throughput R°p ^ < 
Ta- Proposition 14. II is thus proved. 



Next, we show that 

T 



Appendix C 
Proof of Proposition I5.1I 

We first study the PU's throughput under the policy tt and 
construct policy it based on tt. Suppose under the policy tt, 
the PU's throughput from slot t = 1 to slot t = T — 1 is 
^■^{ Y^Jji ^P,a(i)|A(l)} ^ Pa, and the PU's throughput in 
the last slot t = T is ut- From jTH . we have 

(XaoiT) + Xa2iT)) X (1 - /..(T)) = Ut , (45) 

where Pa{T) is given in (|7|i- Note that under policy tt, 
-E.{Ef=i-Rp,a(i)|A(l)}>T,xr. Thus, UT>T,xr-p„. 
The policy tt is constructed based on tt. We let the decision 
functions, as described in Section III-C, from slot t = 1 to 
slot t = T — 1 of policy tt be the same as those of policy 
TT. We thus have E^, { J2jji RpAt)\M^)} = Pa- Different 
from policy tt, under the policy tt in the last slot t ^ T, 
we let E^, {RpAT)\A{T)} = T, x T - p,, by selecting 
actions S'aiT) = 1 - xJ^T'llZ2{T) ^ ^a{T) be the one on the 
optimal ROC curve corresponding to 5a{T), ,fa{0, T) = 0, and 
f'a{l, T) = 1. Note that since ut > Ta xT-pa, from (|45ll, it 
is clear that < Sa{T) < 1. That is, we select feasible actions 

such that (Aao(T) + Aa2(T)) X (1 - Pa{T)) ^TaXT-pa- 



E4J2Rs{t)\A{l)} < E^,{J2RsmA{l)}. 



t=i 



Since the decision functions under policies tt and tt are the 
same from slot t = 1 to slot t = T— 1, the following equation 
holds for the SU's expected throughput from slot i = 1 to slot 

t = T -1: 



T-l 



T-1 



K{ J2 Rsit)\Ail)} =E^,{Y, Rs{t)\A{l)}. (46) 

i=l t=l 

Thus, to compare the SU's expected throughput over T slots 
under pohcies tt and tt , we only need to compare the SU's 
expected rewards in the last slot i = T. It is easy to find that 
the belief states A(T) in slot T under both polices are the 
same. In the following, we compute an upper bound on the 
SU's expected reward in slot t = T under the policy tt, which 
is denoted as £;^li?s(r)|A(T)} with E^ {Rs{T)\A{T)} > 
E^{Rs{T)\A{T)}, and show that E^: {RsiT)\A{T)} > 
E';^{Rs{T)\A{T)}- Denote the SU's actions in slot t = T 
that achieve E!;^ {Rs{T)\A{T)} under the constraint given in 
m by S^iT), e^ain f^iO, T), and f^ {1, T). To find these 
actions, we need to solve an optimization problem, which is to 
maximize {Xai{T)+\a-i{T))xgV{T) subject to ^^'(T) = 1- 
^r — rT^T-\ — T^rn ■ According to Lemma A.l, it is easy to find that 
{d^(T),e^{T)) is on the optimal ROC curve with S^{T)^ 
^" XMn^xMTy fa{^,T)^Q, and /f (1,7)^1. Since ut> 
TaXT-pa, thus S'a{T)>S^{T)- Correspondingly, e'a{T) < 
e^(T). From ©, it is obvious that g'a{T) > g^{T)- Thus, 
given A(r) and from (HB, we have E^, {Rs{T)\A{T)] > 
E^{Rs{T)\A{T)} > E^{RsiT)\A{T)}- From gB, we 
have i?.'{ELi^s(i)|A(l)} > E4j:liRsit)\Ail)}- 
Proposition 15. II is thus proved. 



Appendix D 
Proof of Proposition |5.2| 

Suppose that 7r| and tt* are the SU's optimal policies 
for (P2-S-1). Denote c*(i), 1 < t < T, as the resultant 
PU's throughput in slot t under 7r| and tt*. That is, given 
A(t) in slot t, we have E^.^^.{Rp^a{t)\A{t)} = c*{t) 

and J2t=i'^*i^) ^ ^a X T- Thus, if we have found c*(i), 
1 < t < T, we can adopt the following short-term protection 
for PU's transmission for (P2-S-1), without loss of optimality: 

E^„^^{Rp,ait)\A{t)}=c*it), ViG {!,..., T}. (47) 

Then from ( ISTb and ( |47T i. under the availability assumption of 
c*{t), (P2-S-1) is equivalent to 



(D-1) : max. £;,,.,^{ V i?s(i)|A(l)} 



t=l 

S.t. (Ta{t) = 1 - 



P{I{a,t)^0} 



, VtG {!,..., T}. 
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Since the formulation is similar to (PI) with A^ = 1, from 
Appendix A, the optimal solutions for (D-1) are 

"aKI-) — ^ P{/(a,t)=0}' 

e*(t) is on the optimal ROC curve corresponding 

to S*{tl 
/:(0,i)=0, 

(48) 
Since (D-1) is equivalent to (P2-S-1), (08]) is also the optimal 
solutions for (P2-S-1). Furthermore, with a proof similar to 
that in Appendix B, it is easy to show that if nr^f^l^_m is 



P{/(a,i)=0} 

a constant over time t, the resultant PU's throughput over T 
slots will be smaller than Tq x T, which is contrary to the 



c-{t) 



IS not a 



fact that Ei=ic*{t) = raXT. Thus, p^j^^^jj^ 
constant over time t. Hence, the optimal PM decision (5*(t) 
is time-varying and needs to be adaptively selected based on 
A(t) over time. Proposition 15.21 is thus proved. 



Appendix E 
Proof of Proposition I5.3I 

Firstly, we introduce some new notations for the POMDP. 
According to the complexity analysis for (P2-S-2) in Section 
V-A, given A(l) and the SU's POMDP policy tts for (P2- 
S-2), 2*^^ possible belief states could occur with non-zero 
probability in slot t, Vt e {1,...,T}. We use A''(i) = 
(A:ioW,A^i(i),A^2W,AS3W), b e {l,...,2*-i} to denote 
these possible belief states in slot t and use h^{t) to denote the 
occurrence probability of the belief state A^{t), where h^{t) 
is determined by the SU's action decision history and obser- 
vation history in the previous f — 1 slots and X]b=i ^ i^) — 1- 
We denote the SU's PM decision on channel a for belief state 
A^(t) as 6'^{t). Next, under the MDP policy ttJ^ and the 
POMDP policy ttJ, we give the following lemma, based on 
which. Proposition 15.31 can be proved. 

Lemma E.l: Given A(l) = J^(l), the relationship between 
the POMDP behef states and the MDP state in each slot is 

2t-l 

c.,,(<) = ^/i''(i)xA^(<), VjeCs, V<e{l,...,r}. (49) 

b=l 

Proof: We use mathematical induction to prove this 
lemma. Since $1(1) = A(l), ( |49] l holds when i = 1. Suppose 
( |49] l holds in slot t,t > I, by applying ^ to compute /i''(i+l) 
and applying © and ^} to update the POMDP belief state 
and the MDP state, respectively, after some algebra, we find 
that (|49] | still holds in slot i + 1. Lemma IeTI is thus proved. 



We now prove Proposition 15.31 By computing over all the 
possible belief states in slot t, the PU's throughput under the 
POMDP policy ttJ^ in slot t is 

= J2 h'it) X (A^o(f) + ALW) X (1 - Sl_^{t)). (50) 

6=1 

Under the MDP policy 7rJ_^, suppose Ef=i ^p^aC*)/^ = 
Ta is satisfied. Then under the POMDP policy tt^, from 



Lemma fE. II we find that (ISOl l is equal to ( l28T l under tt^^, 
i.e., E^* {RpAt)\A{l)} = flplW in each slot t. As a 
result, by summing the PU's throughput over all T slots, we 
have E^i {J2f^i RpAt)\M^)} = ^a- Proposition |53] thus 
follows. 

Appendix F 
Proof of Proposition 

The proof is based on mathematical induction and by 
computing backward in time. It is easy to obtain ( |34| | holds 
in slot t = T. Now suppose ^^ holds in slot t + I < T 
with mj{t + 1), j e Cs, given in (l35]l, if i < T - 1, or in 
(|36] |. if i = T — 1. Then the following inequality must hold, 
otherwise, 6M{t + 1) can be shown to be negative: 

Xp{t + l)<{ujo{t+l)+uj2{t + l))xmi{t + l) 

+wi(f+l)xm2(t + l)+a;3(t+l)xm3(i+l). 

By substituting ( |27] ) and dSTT i into the above inequality, after 
some algebra, we find that ( l34l i still holds in slot t with mj{t), 
j G C5, given in ( |35T l. Proposition 15.41 is thus proved. 
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